Convolutional Neural Networks Applied to 
House Numbers Digit Classification 
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Abstract 

We classify digits of real-world house numbers us- 
ing convolutional neural networks (ConvNets). Con- 
vNets are hierarchical feature learning neural net- 
works whose structure is biologically inspired. Un- 
like many popular vision approaches that are hand- 
designed, ConvNets can automatically learn a unique 
set of features optimized for a given task. We augmented 
the traditional ConvNet architecture by learning multi- 
stage features and by using Lp pooling and establish a 
new state-of-the-art of 94.85% accuracy on the SVHN 
dataset (45.2% error improvement). Furthermore, we 
analyze the benefits of different pooling methods and 
multi-stage features in ConvNets. The source code and 
a tutorial are available at eblearn.sf.net. 



1. Introduction 

Character recognition in documents can be consid- 
ered a solved task for computer vision, whether hand- 
written or typed. It is however a harder problem in 
the context of complex natural scenes like photographs 
where the best current methods lag behind human per- 
formance, mainly due to non-contrasting backgrounds, 
low resolution, de-focused and motion-blurred images 
and large illumination differences (Figure [T}. 

||8l recently introduced a new digit classification 
dataset of house numbers extracted from street level 
images. It is similar in format to the popular MNIST 
datasetQ (10 digits, 32x32 inputs), but an order of 
magnitude bigger (600,000 labeled digits), contains 
color information and various natural backgrounds. 

Previous approaches in classifying characters and 
digits from natural images used multiple hand-crafted 
features [3] and template-matching [14]. In contrast, 
ConvNets learn features all the way from pixels to the 
classifier. [8] demonstrated the superiority of learned 
features over hand-designed ones. Such superiority 



IM 

m 
■i 



Itiiil 

m 



m 

■mi 



|76| 



in 



9 



Figure 1. 32x32 cropped samples from the classifi- 
cation task of the SVHN dataset. Each sample is as- 
signed only a single digit label (0 to 9) corresponding 
to the center digit. 



was also previously shown among others in a traffic 
sign classification challenge lfT3l where two indepen- 
dent teams obtained the best performance against vari- 
ous other approaches using ConvNets |Qj][2). El also 
show superior results with unsupervised learning, we 
however only report results with fully- supervised train- 
ing. We obtain a 4.25 points improvement in accuracy 
(with 94.85% accuracy) over the previous state-of-the- 
art of 90.6%. We use the traditional ConvNet archi- 
tecture augmented with different pooling methods and 
with multi-stage features 0T|. This work was imple- 
mented with the EB Learn C++ open- source frame- 
work [TO). 



1 http://eblearn.sf.net 



2 Architecture 



The ConvNet architecture is composed of repeatedly 
stacked feature stages. Each stage contains a convolu- 
tion module, followed by a pooling/subsampling mod- 
ule and a normalization module. While traditional pool- 
ing modules in ConvNet are either average or max pool- 
ings, we use an Lp pooling here. The normalization 
module is subtractive only as opposed to subtractive and 
divisive, i.e. the mean value of each neighborhood is 
subtracted to the output of each stage (but not divided 
by the standard deviation as it decreases performance 
with this dataset). Finally, multi-stage features are also 
used as opposed to single-stage features. 
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Figure 3. A 2-stage ConvNet architecture where 
Multi-Stage features (MS) are fed to a 2-layer classi- 
fier. The 1st stage features are branched out, subsam- 
pled again and then concatenated to 2nd stage features. 



2.1 Lp-Pooling 
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Figure 2. L2-pooling applied to a 9x9 feature map 
with a 3x3 Gaussian kernel and 2x2 stride 

Lp pooling is a biologically inspired pooling layer 
modelled on complex cells (El [3 who's operation can 
be summarized in equation (1), where G is a Gaussian 
kernel, / is the input feature map and O is the output 
feature map. It can be imagined as giving an increased 
weight to stronger features and suppressing weaker fea- 
tures. Two special cases of Lp pooling are notable. 
P = 1 corresponds to a simple Gaussian averaging, 
whereas P = oo corresponds to max-pooling (i.e only 
the strongest signal is activated). Lp-pooling has been 
used previously in ||6| [15) and a theoretical analysis of 
this method is described in Q. 



(i) 



Figure [2] demonstrates a simple example of L2- 
pooling. 



objects such as pedestrians and traffic signs (Table 1). 
The likely explanation for this observation is that gains 
are correlated to the amount of texture and multi- scale 
characteristics of the objects of interest. 

3. Experiments 
3.1. Data Preparation 

The SVHN classification dataset 1 8 ] contains 32x32 
images with 3 color channels. The dataset is divided 
into three subsets: train set, extra set and test set. The 
extra set is a large set of easy samples and train set is 
a smaller set of more difficult samples. Since we are 
given no information about how the sampling of these 
images was done, we assume a random order to con- 
struct our validation set. We compose our validation 
set with 2/3 from training samples (400 per class) and 
1/3 from extra samples (200 per class), yielding a total 
of 6000 samples. This distribution allows to measure 
success on easy samples but puts more emphasis on dif- 
ficult ones. 

Samples are pre-processed with a local contrast nor- 
malization (with a 7x7 kernel) on the Y channel of the 
YUV space followed by a global contrast normalization 
over each channel. No sample distortions were used to 
improve invariance. 



2.2 Multi-Stage Features 



3.2 Architecture Details 



Multi-Stage features (MS) are obtained by branch- 
ing out outputs of all stages into the classifier (Fig- 
ure [3]). They provide richer representations compared 
to Single-Stage features (SS) by adding complementary 
information such as local textures and fine details lost 
by higher levels. MS features have consistently im- 
proved performance in other work E] QJ] and in this 
work as well (Figure HJ). However we observe mini- 
mal gains on this dataset compared to other types of 



The ConvNet has 2 stages of feature extraction and 
a two-layer non-linear classifier. The first convolution 
layer produces 16 features with 5x5 convolution filters 
while the second convolution layer outputs 512 features 
with 7x7 filters. The output to the classifier also in- 
cludes inputs from the first layer, which provides lo- 
cal features/motifs to reinforce the global features. The 
classifier is a 2-layer non-linear classifier with 20 hid- 
den units. Hyper-parameters such as learning rate, reg- 



Task 


Single-Stage features 


Multi-Stage features 


Improvement % 


Pedestrians detection (INRIA) |9] 


14.26% 


9.85% 


31% 


Traffic Signs classification (GTSRB) Qj] 


1.80% 


0.83% 


54% 


House Numbers classification (SVHN) 


5.72% 


5.67% 


0.9% 



Table 1 . Error rates improvements of multi-stage features over single-stage features for different types of objects detection 
and classification. Improvements are significant for multi- scale and textured objects such as traffic signs and pedestrians but 
minimal for house numbers. 
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Figure 4. Improvement of Multi-Stage features (MS) 
over Single-Stage features (SS) in error rate on the val- 
idation set. MS features provide a slight error improve- 
ment over SS features. 
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Figure 5. Error rate of Lp-pooling on the validation 

set for p = 1,2, 4, 8, 12, 16, 32, oo (p = oo is repre- 
sented as p = 100 for convenience). These validation 
errors are reported after 1000 training epochs, p = 12 
performs best with an error rate of 5.61%. 



4 Results & Future Work 



ularization constant and learning rate decay were tuned 
on the validation set. We use stochastic gradient descent 
as our optimization method and shuffle our dataset after 
each training iteration. 

For the pooling layers, we compare Lp-pooling for 
the value p = 1,2, 4, 8, 12, 16, 32, oo on the valida- 
tion set and use the best performing pooling on the final 
testing. The performance of different pooling methods 
on the validation set can be seen in Figure [5] Insights 
from [ Q tell us that the optimal value of p varies for dif- 
ferent input spaces and there is no single globally opti- 
mal value for p. For our validation data, we observe that 
p = 2, 4, 12 give the best performance (5.62%, 5.64% 
and 5.61% respectively). Max-pooling, which corre- 
sponds to p = oo yielded a validation error rate of 
7.57%. 



Our experiments demonstrate a clear advantage of 
Lp pooling with 1 < p < oo on this dataset in valida- 
tion (Figure [5]) and test (Average pooling is 3.58 points 
inferior to L2 pooling in Table 2). With L4 pooling, we 
obtain a state-of-the-art performance on the test set with 
an accuracy of 94.85% compared to the previous best of 
90.6% (Table 2). We also show that using multi-stage 
features gives only a slight increase in performance, 
compared to the performance increase seen in other vi- 
sion applications. 

Additionally, it is important to note that our approach 
is trained fully supervised only, whereas the best pre- 
vious methods are unsupervised learning methods (k- 
means, auto-encoders). We shall, in the future, run ex- 
periments with unsupervised learning, to compare the 
accuracy improvement that can be attributed to supervi- 
sion. Figure [6] shows the validation samples with high- 
est energy. Many of these seem to exhibit large scale 
variations, future work could address this problem by 
introducing artificial scale deformations during train- 
ing. 



Algorithm 


SVHN-Test Accuracy 


binary reatures (WUCrlJ 




rlULr 




Stacked Sparse Auto-Encoders 


on h ci 


rv-lViedlls 


QO f\ C 7r 

yyj.D /c 


ConvNet / MS / Averape 


90.75% 


ConvNet / MS / L2 / Smaller training 


91.55% 


ConvNet / SS / L2 


94.28% 


ConvNet/ MS /L2 


94.33% 


ConvNet/ MS /L12 


94.76% 


ConvNet/ MS /L4 


94.85% 


Human Performance 


98.0% 



Table 2. Performance reported by [8 1 with the additional Supervised ConvNet with state-of-the-art accuracy of 94.85%. 




Figure 6. Preprocessed Y channel of validation samples with highest energy (i.e. highest error) with the 94.33% accuracy 
L2-pool based multi-stage ConvNet. 
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